POST: Using Probabilities in Language Processing
نویسندگان
چکیده
We report here on our experiments with POST (Part of Speech Tagger) to address problems of ambiguity and of understanding unknown words. Part of speech tagging, per se, is a well understood problem. Our paper reports experiments in three important areas: handling unknown words, l imit ing the size of the training set, and returning a set of the most likely tags for each word rather than a single tag. We describe the algorithms that we used and the specific results of our experiments on Wall Street Journal articles and on MUC terrorist messages. 1. In t roduct ion 1 Natural language processing, and Al in general, have focused mainly on building rule-based systems with carefully handcrafted rules and domain knowledge. Our own natural language database query systems, JANUS 2 , ParlanceTM and Delph i 4 , use these techniques quite successfully. However, as we move from the problem of understanding queries in fixed domains to processing open text for applications such as data extraction, we have found rule-based techniques too brittle, and the amount of work necessary to bui ld them intractable, especially when attempting to use the same system on multiple domains. We report in this paper on one application of probabilistic models to language processing, the assignment of part of speech to words in open text. The effectiveness of such models is well known [DeRose, 1988; Church, 1988; Kupiec, 1989; Jelinek, 1985] and they are currently in use in parsers [e.g. de Marcken, 1990]. Our work is an incremental improvement on these models in two ways: (1) We have 1 The work reported here was supported by the Advanced Research Projects Agency and was monitored by the Rome Air Development Center under Contract No. F30602-87-D-OO93. The views and conclusions contained in this document are those of the authors and should not be interpreted as necessarily representing the official policies, whether expressed or implied, of the Defense Advanced Research Projects Agency or the United
منابع مشابه
مقایسه روش های طیفی برای شناسایی زبان گفتاری
Identifying spoken language automatically is to identify a language from the speech signal. Language identification systems can be divided into two categories, spectral-based methods and phonetic-based methods. In the former, short-time characteristics of speech spectrum are extracted as a multi-dimensional vector. The statistical model of these features is then obtained for each language. The ...
متن کاملEfficient OCR Post-Processing Combining Language, Hypothesis and Error Models
In this paper, an OCR post-processing method that combines a language model, OCR hypothesis information and an error model is proposed. The approach can be seen as a flexible and efficient way to perform Stochastic Error-Correcting Language Modeling. We use Weighted Finite-State Transducers (WFSTs) to represent the language model, the complete set of OCR hypotheses interpreted as a sequence of ...
متن کاملHeuristic Approach for Specially Structured Two Stage Flow Shop Scheduling to Minimize the Rental Cost, Processing Time, Set Up Time Are Associated with Their Probabilities Including Transportation Time and Job Weightage
The present paper is an attempt to develop a new heuristic algorithm, find the optimal sequence to minimize the utilization time of the machines and hence their rental cost for two stage specially structured flow shop scheduling under specified rental policy in which processing times and set up time are associated with their respective probabilities including transportation time. Further jo...
متن کاملA word language model based contextual language processing on Chinese character recognition [7534-22]
The language model design and implementation issue is researched in this paper. Different from previous research, we want to emphasize the importance of n-gram models based on words in the study of language model. We build up a word based language model using the toolkit of SRILM and implement it for contextual language processing on Chinese documents. A modified Absolute Discount smoothing alg...
متن کاملA word language model based contextual language processing on Chinese character recognition
The language model design and implementation issue is researched in this paper. Different from previous research, we want to emphasize the importance of n-gram models based on words in the study of language model. We build up a word based language model using the toolkit of SRILM and implement it for contextual language processing on Chinese documents. A modified Absolute Discount smoothing alg...
متن کاملA Reflection on Kristeva's Approach to the Structure of Language
Reaching out to history and subject in terms of meaning variation, Kristeva could show that language cannot simply be a Saussurean sign system. Rather, she went on to delineate that language, beyond signs, is associated with a dynamic system of signification where the ''speaking subject'' is constantly involved in processing. Julia Kristeva, a French critic, psychoanalyst, theoretician, a post-...
متن کامل